docs: Add GEPA prompt optimization and human agreement specs by vivian-xie-db · Pull Request #95 · databricks-solutions/project-0xfffff

vivian-xie-db · 2026-02-13T00:53:57Z

Summary

PROMPT_OPTIMIZATION_SPEC.md: Declarative spec for the GEPA prompt optimization pipeline — covers MLflow optimize_prompts API, training data from annotated traces, score normalization, predict_fn behavior, custom endpoint support, config persistence, auto-reconnect, and score improvement display.
HUMAN_AGREEMENT_SPEC.md: Declarative spec for GDPVal A^HH human-to-human agreement metric — covers the formula E[1 - |H_1 - H_2|], rating normalization (Likert/binary), pairwise agreement %, per-metric breakdowns, and IRR integration.

Test plan

Specs are well-formed markdown and render correctly on GitHub
Content matches current implementation behavior

🤖 Generated with Claude Code

Add declarative specifications for two key features: - PROMPT_OPTIMIZATION_SPEC.md: GEPA optimizer pipeline, training data, score normalization, predict_fn, config persistence, auto-reconnect - HUMAN_AGREEMENT_SPEC.md: GDPVal A^HH human-to-human agreement metric, rating normalization, pairwise agreement %, IRR integration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

FMurray · 2026-02-13T01:08:24Z

IRR metrics are currently crammed into the Judge evaluation spec where they probably shouldn't be. We should move the IRR section out and merge it into the human agreement spec you've added here

FMurray · 2026-02-13T01:13:24Z

+|--------|-----------------|-----------------|
+| **GDPVal A^HH** (this spec) | Human vs Human agreement | IRR Results page |
+| Pairwise Agreement % | Human vs Human agreement (percentage) | IRR Results page |
+| Cohen's Kappa | Judge vs Human agreement | Judge Tuning page |


GDPval uses A^HA as well? Why Cohen's Kappa still?

Cohen's Kappa is in judge tuning page. There is a small section above evaluation results where it is showing the Cohen's Kappa score, which has already been there since version 1.0

…el in judge evaluation spec Remove the generic IRR section (Krippendorff's Alpha, Cohen's Kappa for rater pairs) from JUDGE_EVALUATION_SPEC and replace with a detailed Cohen's Kappa Metrics Panel spec covering the judge-vs-human agreement metrics displayed after evaluation on the Judge Tuning page. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

vivian-xie-db requested a review from forrestmurray-db February 13, 2026 00:58

FMurray reviewed Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Add GEPA prompt optimization and human agreement specs#95

docs: Add GEPA prompt optimization and human agreement specs#95
vivian-xie-db wants to merge 2 commits into
mainfrom
add_optmization_human_agreement_spec

vivian-xie-db commented Feb 13, 2026

Uh oh!

FMurray commented Feb 13, 2026

Uh oh!

FMurray Feb 13, 2026

Uh oh!

vivian-xie-db Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vivian-xie-db commented Feb 13, 2026

Summary

Test plan

Uh oh!

FMurray commented Feb 13, 2026

Uh oh!

FMurray Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

vivian-xie-db Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants